50 research outputs found

    TurboMGNN : improving concurrent GNN training tasks on GPU with fine-grained kernel fusion

    Get PDF
    Graph Neural Networks (GNN) have evolved as powerful models for graph representation learning. Many works have been proposed to support GNN training efficiently on GPU. However, these works only focus on a single GNN training task such as operator optimization, task scheduling, and programming model. Concurrent GNN training, which is needed in the applications such as neural network structure search, has not been explored yet. This work aims to improve the training efficiency of the concurrent GNN training tasks on GPU by developing fine-grained methods to fuse the kernels from different tasks. Specifically, we propose a fine-grained Sparse Matrix Multiplication (SpMM) based kernel fusion method to eliminate redundant accesses to graph data. In order to increase the fusion opportunity and reduce the synchronization cost, we further propose a novel technique to enable the fusion of the kernels in forward and backward propagation. Finally, in order to reduce the resource contention caused by the increased number of concurrent, heterogeneous GNN training tasks, we propose an adaptive strategy to group the tasks and match their operators according to resource contention. We have conducted extensive experiments, including kernel- and model-level benchmarks. The results show that the proposed methods can achieve up to 2.6X performance speedup

    Feluca : A two-stage graph coloring algorithm with color-centric paradigm on GPU

    Get PDF
    In this paper, we propose a two-stage high-performance graph coloring algorithm, called Feluca, aiming to address the above challenges. Feluca combines the recursion-based method with the sequential spread-based method. In the first stage, Feluca uses a recursive routine to color a majority of vertices in the graph. Then, it switches to the sequential spread method to color the remaining vertices in order to avoid the conflicts of the recursive algorithm. Moreover, the following techniques are proposed to further improve the graph coloring performance. i) A new method is proposed to eliminate the cycles in the graph; ii) a top-down scheme is developed to avoid the atomic operation originally required for color selection; and iii) a novel color-centric coloring paradigm is designed to improve the degree of parallelism for the sequential spread part. All these newly developed techniques, together with further GPU-specific optimizations such as coalesced memory access, comprise an efficient parallel graph coloring solution in Feluca. We have conducted extensive experiments on NVIDIA GPUs. The results show that Feluca can achieve 1.76 - 12.98x speedup over the state-of-the-art algorithms

    Graph Processing on GPUs:A Survey

    Get PDF

    Waterwave : a GPU memory flow engine for concurrent DNN training

    Get PDF
    Training Deep Neural Networks (DNN) concurrently is becoming increasingly important for deep learning practitioners, e.g., hyperparameter optimization (HPO) and neural architecture search (NAS) . The GPU memory capacity is the impediment that prohibits multiple DNNs from being trained on the same GPU due to the large memory usage during training. In this paper, we propose Waterwave a GPU memory flow engine for concurrent deep learning training. Firstly, to address the memory explosion brought by the long time lag between memory allocation and deallocation time, we develop an allocator tailored for multi-streams. By making the allocator aware of the stream information, a prioritized allocation is conducted based on the chunk's synchronization attributes, allowing us to provide useable memory after scheduling rather than waiting it to be really released after GPU computation. Secondly, Waterwave partitions the compute graph to a set of continuous node groups and then performs finer-grained scheduling: NodeGroup pipeline execution , to guarantee a proper memory requests order. Waterwave can accomplish up to 96.8% of the maximum batch size of solo training. Additionally, in scenarios with high memory demand, Waterwave can outperform existing spatial sharing and temporal sharing by up to 12x and 1.49x, respectively

    Deca : a garbage collection optimizer for in-memory data processing

    Get PDF
    In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca,1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2× to 22.7× speedup in terms of execution time in cases without data spilling and 16× to 41.6× speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems

    D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

    Get PDF
    International audienceSince its introduction in 2004 by Google, MapRe-duce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D 3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3-MapReduce environment

    Distribution of HLA-A, -B and -DRB1 Genes and Haplotypes in the Tujia Population Living in the Wufeng Region of Hubei Province, China

    Get PDF
    BACKGROUND: The distribution of HLA alleles and haplotypes varies widely between different ethnic populations and geographic areas. Before any genetic marker can be used in a disease-associated study it is therefore essential to investigate allelic frequencies and establish a genetic database. METHODOLOGY/PRINCIPAL FINDINGS: This is the first report of HLA typing in the Tujia group using the Luminex HLA-SSO method HLA-A, -B and -DRB1 allelic distributions were determined in 124 unrelated healthy Tujia individuals, and haplotypic frequencies and linkage disequilibrium parameters were estimated using the maximum-likelihood method. In total 10 alleles were detected at the HLA-A locus, 21 alleles at the HLA-B locus and 14 alleles at the HLA-DRB1 locus. The most frequently observed alleles in the HLA-I group were HLA-A*02 (35.48%), A*11 (28.23%), A*24 (15.73%); HLA-B*40 (25.00%), B*46 (16.13%), and B*15 (15.73%). Among HLA-DRB1 alleles, high frequencies of HLA-DRB1*09 (25.81%) were observed, followed by HLA-DRB1*15 (12.9%), and DRB1*12 (10.89%). The two-locus haplotypes at the highest frequency were A*02-B*46A (8.47%), followed by A*11-B*40 (7.66%), A*02-B*40 (8.87%), A*11-B*15 (6.45%), A*02-B*15 (6.05%), B*40-DRB1*09 (9.27%) and B*46-DRB1*09 (6.45%). The most common three-locus haplotypes found in the Tujia population were A*02-B*46-DRB1*09 (4.84%) and A*02-B*40-DRB1*09 (4.03%). Fourteen two-loci haplotypes had significant linkage disequilibrium. Construction of a neighbor-joining phylogenetic tree and principal component analysis using the allelic frequencies at HLA-A was performed to compare the Tujia group and twelve other previously reported populations. The Tujia population in the Wufeng of Hubei Province had the closest genetic relationship with the central Han population, and then to the Shui, the Miao, the southern Han and the northern Han ethnic groups. CONCLUSIONS/SIGNIFICANCE: These results will become a valuable source of data for tracing population migration, planning clinical organ transplantation, carrying out HLA-linked disease-associated studies and forensic identification
    corecore